CNV-74265: Hot cluster CI update#4099
Conversation
|
@galkremer1: This pull request references CNV-74265 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: galkremer1 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds CI environment controller assets, new Helm charts, composite actions for requesting and releasing environments, hot-cluster GitHub workflows, IBM Cloud teardown automation, and supporting scripts, docs, and test coverage. ChangesCI Environment Controller and Test Stack
Hot Cluster Workflows
Docs, Helpers, and Local Tooling
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
|
✅ Jira Validation Passed CNV-74265
|
There was a problem hiding this comment.
Actionable comments posted: 4
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
ci-scripts/images/setup-arc-runner-image.sh (1)
121-127:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRestore
IMAGE_REF=stdout contractLine 126 is commented, so the script no longer emits the
IMAGE_REF=line promised at the top of the file. This can break callers that parse stdout.Proposed fix
if [[ -n "${ARC_RUNNER_IMAGE_FILE:-}" ]]; then printf '%s\n' "${IMAGE_REF}" > "${ARC_RUNNER_IMAGE_FILE}" echo "Wrote ${ARC_RUNNER_IMAGE_FILE}" fi -#echo "IMAGE_REF=${IMAGE_REF}" +echo "IMAGE_REF=${IMAGE_REF}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/images/setup-arc-runner-image.sh` around lines 121 - 127, The echo statement that outputs the IMAGE_REF variable to stdout is currently commented out on the line showing `#echo` "IMAGE_REF=${IMAGE_REF}". Uncomment this line to restore the stdout contract that callers depend on for parsing the image reference.ci-scripts/images/setup-ci-env-runner-image.sh (1)
118-124:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winMissing
IMAGE_REF=output breaks controller install flow
install-ci-env-controller.shexplicitly greps this script’s stdout forIMAGE_REF=and exits on empty output, so Line 123 being commented is a hard failure path. Also, the optional file-output variable name is inconsistent with the script header.Proposed fix
-# TODO: Better handling of passing the fqdn image name to the caller -if [[ -n "${CI_ENV_RUNNER_IMAGE_FILE:-}" ]]; then - printf '%s\n' "${IMAGE_REF}" > "${CI_ENV_RUNNER_IMAGE_FILE}" - echo "Wrote ${CI_ENV_RUNNER_IMAGE_FILE}" -fi -#echo "IMAGE_REF=${IMAGE_REF}" +# TODO: Better handling of passing the fqdn image name to the caller +echo "IMAGE_REF=${IMAGE_REF}" +CI_ENV_IMAGE_FILE="${CI_ENV_CONTROLLER_IMAGE_FILE:-${CI_ENV_RUNNER_IMAGE_FILE:-}}" +if [[ -n "${CI_ENV_IMAGE_FILE}" ]]; then + printf '%s\n' "${IMAGE_REF}" > "${CI_ENV_IMAGE_FILE}" + echo "Wrote ${CI_ENV_IMAGE_FILE}" +fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/images/setup-ci-env-runner-image.sh` around lines 118 - 124, Uncomment the line that echoes the IMAGE_REF variable (the `echo "IMAGE_REF=${IMAGE_REF}"` line) since the install-ci-env-controller.sh script explicitly greps stdout for the IMAGE_REF= output and will fail if this line remains commented out. Additionally, review the variable name CI_ENV_RUNNER_IMAGE_FILE used in the optional file-output block and verify it matches the naming convention documented in the script header for consistency..github/workflows/hot-cluster-e2e.yml (1)
93-97:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid broad
secrets: inheritin reusable workflow calls.This passes all available secrets to the called workflow. Limit to explicit secrets required by
hot-cluster-e2e-run.ymlto enforce least privilege.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/hot-cluster-e2e.yml around lines 93 - 97, The `secrets: inherit` statement passes all available secrets to the hot-cluster-e2e-run.yml reusable workflow, violating the principle of least privilege. Replace `secrets: inherit` with an explicit mapping that only includes the specific secrets required by hot-cluster-e2e-run.yml. Use the format `secrets:` followed by individual secret declarations (e.g., `SECRET_NAME: ${{ secrets.SECRET_NAME }}`) for each secret actually needed by the called workflow, rather than inheriting all secrets indiscriminately.Source: Linters/SAST tools
🟠 Major comments (37)
cypress/tests/gating/poc-check-tab-yaml.cy.ts-107-232 (1)
107-232: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winRemove commented-out test code.
Lines 107-232 contain 126 lines of commented-out test code for Templates, InstanceTypes, Bootable volumes, MigrationPolicies, and Checkups. Committed commented-out code is a maintainability issue:
- It clutters the codebase and reduces readability
- It creates confusion about whether the code should be used
- Version control already preserves deleted code if it's needed later
If these tests are work-in-progress, track them in a Jira issue instead. If they're no longer needed, remove them.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cypress/tests/gating/poc-check-tab-yaml.cy.ts` around lines 107 - 232, Remove all the commented-out test code from the file. Delete the entire block of commented code that includes the describe blocks for 'Check Templates page', 'Check InstanceTypes tabs', 'Check Bootable volumes page', 'Check MigrationPolicies page', and 'Check Checkups tabs'. This commented-out code clutters the codebase and reduces maintainability. If these tests are needed in the future, they can be recovered from version control history.cypress/tests/gating/poc-check-tab-yaml.cy.ts-18-18 (1)
18-18:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReplace arbitrary time-based waits with condition-based waits.
Lines 18 and 30 use
cy.wait()with hardcoded timeouts (15 and 10 seconds), which is an anti-pattern in Cypress tests. According to the coding guidelines, you should useintercept()andwait()to handle network requests with appropriate timeouts.Instead of arbitrary waits, prefer:
- Waiting for specific DOM conditions (
cy.get(...).should('exist'))- Intercepting network requests and waiting for them to complete
- Using built-in Cypress retry-ability with assertions
♻️ Recommended approach
For line 18, instead of:
- cy.wait(15 * SECOND);Wait for a specific condition that indicates the VM has processed the start action:
cy.get(sel.vmStatusOnOverview, { timeout: 15 * SECOND }).should('be.visible');For line 30, the subsequent test already checks the status, so this wait might be redundant or could be replaced with a more specific assertion.
Also applies to: 30-30
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cypress/tests/gating/poc-check-tab-yaml.cy.ts` at line 18, Replace the arbitrary time-based waits using cy.wait(15 * SECOND) at line 18 and cy.wait(10 * SECOND) at line 30 with condition-based waits that verify specific DOM elements or network requests. For line 18, replace the cy.wait() call with cy.get(sel.vmStatusOnOverview, { timeout: 15 * SECOND }).should('be.visible') to wait for the VM status element to actually appear rather than waiting arbitrarily. For line 30, either remove the wait if the subsequent test already validates the expected state, or replace it with a specific DOM or network condition assertion that confirms the necessary precondition has been met.Source: Coding guidelines
ci-scripts/install-oc-client.sh-22-24 (1)
22-24:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrefer OpenShift version fields before
masterKubeVersionwhen derivingOPENSHIFT_VERSION.At Line 22, prioritizing
masterKubeVersioncan resolve to a Kubernetes version (e.g.,1.xx), which then builds an invalid OCP client URL at Line 46 (stable-1.xx). Use.openshiftVersion/.versionfirst, then fallback.Also applies to: 46-46
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/install-oc-client.sh` around lines 22 - 24, The order of field prioritization in the fallback chain for deriving OPENSHIFT_VERSION is incorrect. Currently, masterKubeVersion is being checked first, which can resolve to a Kubernetes version (e.g., 1.xx) instead of an OpenShift version, causing an invalid OCP client URL to be constructed later (e.g., stable-1.xx at line 46). Reorder the fallback chain to prioritize openshiftVersion and version first, then fall back to masterKubeVersion as the last option. This ensures that OpenShift-specific version fields are used before the generic Kubernetes version field.ci-scripts/check-roks-cluster-state.sh-26-30 (1)
26-30:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t swallow
ibmclouderrors into{}in the polling loop.Falling back to
"{}"masks auth/cluster-not-found/API failures and turns hard errors into long timeouts. Fail fast on command failure (or after a small retry budget) and surface stderr.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/check-roks-cluster-state.sh` around lines 26 - 30, The ibmcloud oc cluster get command on the CLUSTER_JSON assignment is currently suppressing errors by falling back to an empty JSON object with "|| echo "{}"". Remove this fallback mechanism so that authentication failures, cluster-not-found errors, and API failures cause the script to fail immediately rather than masking the actual problem. Let the command's stderr be visible to the user and either fail fast on command failure or implement a small retry budget before giving up, ensuring that hard errors are surfaced instead of being hidden in a long timeout loop.ci-scripts/check-roks-cluster-state.sh-15-16 (1)
15-16:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate
MAX_WAIT/INTERVALas positive integers before the loop.
INTERVAL=0causes a non-progressing loop (ELAPSEDnever increases), and non-numeric values can breaksleep/arithmetic paths.Also applies to: 45-46
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/check-roks-cluster-state.sh` around lines 15 - 16, The MAX_WAIT and INTERVAL variables are set with default values but lack validation before being used in the loop, which can cause issues like infinite loops when INTERVAL is zero or script failures with non-numeric values. Add validation logic after the variable assignments (at lines 15-16 and also at lines 45-46 as indicated) to ensure both MAX_WAIT and INTERVAL are positive integers, exiting with an error message if validation fails. This validation should occur before any loop or arithmetic operations that depend on these variables.ci-scripts/_cluster-helpers.sh-35-43 (1)
35-43:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the route’s targetPort when rewriting internal service URLs.
The current rewrite picks
service.spec.ports[0].port, which can mismatch the route backend port. If the route targets a non-first port, the rewritten URL is wrong and downstream CLI downloads fail.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/_cluster-helpers.sh` around lines 35 - 43, The current implementation extracts the service name from the route but ignores the route's targetPort specification, instead always using the first port from the service spec. Modify the jq query that extracts route_info to also capture the route's targetPort from the spec.to object, then use this extracted targetPort value in the final URL construction instead of querying the service for spec.ports[0].port. This ensures the rewritten URL targets the correct backend port that the route actually specifies.ci-scripts/arc/arc-runner-rbac.yaml-27-45 (1)
27-45:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftReduce runner RBAC blast radius (cluster-wide write access is too broad).
Granting cluster-scoped
patchonvirtualmachinesplus broad cluster reads to the PR runner SA materially increases impact if a runner job is abused. Scope write permissions to the ephemeral test namespace via namespacedRole/RoleBinding, and keep cluster-wide access strictly read-only and minimal.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/arc/arc-runner-rbac.yaml` around lines 27 - 45, Remove the cluster-wide write access by eliminating the 'patch' verb from the virtualmachines resource rule in this ClusterRole, retaining only the read-only verbs 'get', 'list', and 'watch'. Create a separate namespaced Role bound to the test namespace that grants the 'patch' verb on virtualmachines resources, ensuring Cypress cy.exec runs can still patch virtualmachines but only within the ephemeral test namespace, thus limiting the RBAC blast radius if the runner is compromised.ci-scripts/arc/arc-runner-scale-set.pod.yaml-23-23 (1)
23-23:⚠️ Potential issue | 🟠 MajorPin runner image to an immutable tag or digest instead of
latest.Using
ghcr.io/actions/actions-runner:latestmakes CI behavior non-deterministic and can pull unreviewed upstream changes between runs. Replace with a specific version tag or image digest.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/arc/arc-runner-scale-set.pod.yaml` at line 23, Replace the image tag "latest" in the ghcr.io/actions/actions-runner image specification with a specific version tag or an immutable image digest. Instead of using the "latest" tag which pulls unpredictable versions, pin the image to a concrete version (e.g., a specific release tag like "v2.315.0") or use the full image digest hash to ensure reproducible and deterministic CI behavior across runs.ci-scripts/install-oc-client.sh-48-54 (1)
48-54:⚠️ Potential issue | 🟠 MajorDownload and verify the
ocarchive checksum before extraction.The script lacks integrity validation. OpenShift's mirrors provide
sha256sum.txtandsha256sum.txt.gpgfiles in the same directory. Download the checksum file, verify its GPG signature with the Red Hat release key, then validate the archive hash before extraction:curl -fsSL -o "${ARCHIVE}" "${URL}" curl -fsSL -o "${CHECKSUMS}" "${URL%/*}/sha256sum.txt" curl -fsSL -o "${CHECKSUMS}.gpg" "${URL%/*}/sha256sum.txt.gpg" gpg --verify "${CHECKSUMS}.gpg" "${CHECKSUMS}" grep "openshift-client-linux.tar.gz" "${CHECKSUMS}" | sha256sum -c -🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/install-oc-client.sh` around lines 48 - 54, The script downloads the oc archive but does not verify its integrity before extraction, creating a security risk. After the successful curl download of ARCHIVE from URL, add steps to download the sha256sum.txt and sha256sum.txt.gpg files from the same URL directory, verify the GPG signature of the checksums file using the Red Hat release key, and then validate the archive hash against the checksums before proceeding with the tar extraction. Only allow the tar extraction and install commands to execute if the checksum verification succeeds.ci-scripts/images/ci-env-runner/helm/ci-test-stack-1-1 (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSymlinked chart path is causing CI YAML/Helm parse failures
The new symlinked chart location is very likely what triggered the repeated parser errors in CI (invalid YAML at multiple template lines). This duplicates Helm templates under a path that build-time YAML checks appear to scan as plain YAML. Please either exclude this path from the YAML parse step or avoid exposing the chart tree here at repo level.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/images/ci-env-runner/helm/ci-test-stack` at line 1, The symbolic link pointing from ci-env-runner/helm/ci-test-stack to ../../../helm/ci-test-stack is being picked up and scanned by the CI build-time YAML validation checks as plain YAML files, which is causing parse failures on Helm template syntax. Either remove this symlink entirely and use direct file references where needed, or configure the YAML parsing step in your CI pipeline to explicitly exclude the ci-scripts/images/ci-env-runner/helm directory from validation checks so symlinked Helm charts are not treated as raw YAML.Source: Pipeline failures
.github/workflows/ibmc-cluster-setup.yml-130-138 (1)
130-138:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid direct
${{ inputs.* }}expansion inside shell commands.Direct template expansion in
run:allows shell-level injection via crafted workflow inputs. Bind inputs viaenv:and use shell variables instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ibmc-cluster-setup.yml around lines 130 - 138, The workflow is directly expanding workflow inputs using ${{ inputs.* }} syntax within the shell command, which creates a potential security vulnerability for shell injection. Move the input parameters (worker_count, worker_flavor, and openshift_version) into environment variables using the env: section above the run: step, then reference those environment variables using shell variable syntax (e.g., $WORKER_COUNT instead of ${{ inputs.worker_count }}) within the ibmcloud oc cluster create classic command to safely pass the inputs to the shell.Source: Linters/SAST tools
.github/workflows/ibmc-cluster-auto-teardown.yml-153-153 (1)
153-153:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove template expansion from inline JavaScript.
Injecting
${{ env.CLUSTER_NAME }}directly into JS source introduces script-injection risk. Readprocess.env.CLUSTER_NAMEinside the script instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ibmc-cluster-auto-teardown.yml at line 153, The inline JavaScript code at line 153 in ibmc-cluster-auto-teardown.yml has a script injection vulnerability because cluster_name is being set with the GitHub Actions template expansion syntax ${{ env.CLUSTER_NAME }} directly in the JavaScript source. Remove this template expansion from the cluster_name assignment and instead access the environment variable directly within the script using process.env.CLUSTER_NAME to safely read the environment variable value at runtime.Source: Linters/SAST tools
.github/workflows/hot-cluster-e2e-run.yml-29-29 (1)
29-29:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInclude
github.run_attemptin ephemeral resource names.
TEST_NSandCI_ENV_CMcurrently key only ongithub.run_id, so reruns can collide with stale resources from earlier attempts.Suggested fix
- TEST_NS: kubevirt-plugin-ci-test-${{ github.run_id }} + TEST_NS: kubevirt-plugin-ci-test-${{ github.run_id }}-${{ github.run_attempt }} @@ - CI_ENV_CM: ci-env-${{ github.run_id }} + CI_ENV_CM: ci-env-${{ github.run_id }}-${{ github.run_attempt }}Also applies to: 35-35
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/hot-cluster-e2e-run.yml at line 29, The TEST_NS and CI_ENV_CM ephemeral resource names only use github.run_id, which causes collisions when a workflow is retried since multiple attempts share the same run_id. Update both the TEST_NS variable definition (around line 29) and the CI_ENV_CM variable definition (around line 35) to include github.run_attempt in addition to github.run_id, ensuring each workflow attempt gets a unique resource name by appending the attempt number to the existing identifier..github/workflows/ibmc-cluster-teardown.yml-34-34 (1)
34-34:⚠️ Potential issue | 🟠 Major | ⚡ Quick winJob timeout is shorter than the deletion wait window.
The job times out at 60 minutes, but the script intentionally waits up to 120 minutes. This can terminate teardown mid-flight and produce noisy failures.
Also applies to: 79-96
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ibmc-cluster-teardown.yml at line 34, The job timeout is set to 60 minutes but the deletion script waits up to 120 minutes, causing premature job termination. Increase the timeout-minutes value at line 34 from 60 to at least 120 to accommodate the full deletion wait window. Also check and apply the same fix to the timeout settings in lines 79-96 where this issue also applies to ensure all teardown jobs have sufficient time to complete..github/actions/ci-env-request/action.yml-34-49 (1)
34-49:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake trigger ConfigMap creation idempotent.
Using
oc createcauses immediate failure on retries/reruns when the ConfigMap already exists, which blocks environment provisioning.Suggested fix
- cat <<EOF | oc create -f - + cat <<EOF | oc apply -f - apiVersion: v1 kind: ConfigMap metadata: name: ${{ inputs.configmap-name }} namespace: ${{ inputs.ci-env-namespace }} labels: ci.kubevirt-plugin/type: test-environment data: desired-state: "present" plugin-image: "${{ inputs.plugin-image }}" test-namespace: "${{ inputs.test-namespace }}" EOF🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/actions/ci-env-request/action.yml around lines 34 - 49, The "Create trigger ConfigMap" step uses `oc create` which will fail if the ConfigMap already exists, breaking idempotency on retries or reruns. Replace the `oc create -f -` command with `oc apply -f -` to make the operation idempotent, allowing the ConfigMap to be created if it doesn't exist or updated if it already does..github/workflows/hot-cluster-e2e-run.yml-289-293 (1)
289-293:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not inject
${{ inputs.test_project }}directly into bash code.This is template-injection prone. Bind
test_projectviaenv:and compare the environment variable in shell.Suggested fix
- name: Run Playwright gating tests env: BRIDGE_BASE_ADDRESS: ${{ steps.ci-env.outputs.bridge-base-address }} + TEST_PROJECT: ${{ inputs.test_project }} run: | - if [[ "${{ inputs.test_project }}" == "features" ]]; then + if [[ "${TEST_PROJECT}" == "features" ]]; then npm run test-playwright-headless -- --project=features else npm run test-playwright-headless -- --project=gating fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/hot-cluster-e2e-run.yml around lines 289 - 293, The direct injection of ${{ inputs.test_project }} into the bash conditional is a template-injection vulnerability. Instead, add an `env:` section at the step level to bind the input value to an environment variable (e.g., TEST_PROJECT), then replace the direct template injection ${{ inputs.test_project }} with the environment variable reference $TEST_PROJECT in the conditional comparison.Source: Linters/SAST tools
.github/workflows/ibmc-cluster-auto-teardown.yml-35-35 (1)
35-35:⚠️ Potential issue | 🟠 MajorPin third-party actions to immutable commit SHAs.
Replace version tags with commit SHAs:
- Line 35:
actions/github-script@v8- Line 86:
IBM/actions-ibmcloud-cli@v1- Line 144:
actions/github-script@v8Version tags are mutable and can change unexpectedly, compromising build determinism and auditability. Pinning to specific commit SHAs ensures reproducible and secure CI/CD workflows.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ibmc-cluster-auto-teardown.yml at line 35, Replace the mutable version tag references with immutable commit SHAs for all third-party GitHub Actions in the workflow file. For the actions/github-script action at line 35 and line 144, replace `@v8` with the corresponding commit SHA. For the IBM/actions-ibmcloud-cli action at line 86, replace `@v1` with the corresponding commit SHA. Each uses field should reference a specific commit hash instead of a version tag to ensure reproducible and secure CI/CD workflows.Source: Linters/SAST tools
.github/actions/ci-env-release/action.yml-49-58 (1)
49-58:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not delete the trigger ConfigMap when cleanup times out.
Current flow deletes the ConfigMap even after timeout, which can prevent the controller from finishing teardown and leak cluster resources.
Suggested fix
echo "Waiting for controller to clean up..." INTERVAL=5 ELAPSED=0 + CLEANED=false while true; do STATUS="$(oc get configmap "${CM_NAME}" -n "${CM_NS}" \ -o jsonpath='{.data.status}' 2>/dev/null || echo "")" if [[ "${STATUS}" == "cleaned" ]]; then echo "Cleanup complete." + CLEANED=true break fi @@ if (( ELAPSED >= TIMEOUT )); then echo "::warning::Timed out waiting for controller cleanup (status=${STATUS})" break fi @@ done - oc delete configmap "${CM_NAME}" -n "${CM_NS}" 2>/dev/null || true + if [[ "${CLEANED}" == "true" ]]; then + oc delete configmap "${CM_NAME}" -n "${CM_NS}" 2>/dev/null || true + else + echo "::warning::Skipping ConfigMap deletion because cleanup did not reach status=cleaned" + fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/actions/ci-env-release/action.yml around lines 49 - 58, The ConfigMap deletion using oc delete configmap "${CM_NAME}" -n "${CM_NS}" is being executed unconditionally after the cleanup loop, even when the loop times out and breaks due to ELAPSED >= TIMEOUT. To fix this, only execute the ConfigMap deletion when the cleanup completes successfully without timing out. You can achieve this by either moving the deletion inside the loop before the break statement when timeout occurs, or by wrapping the deletion in a condition that ensures it only runs when the timeout was not reached. This prevents resource leaks by allowing the controller to complete its teardown without the ConfigMap being prematurely removed..github/workflows/ibmc-cluster-setup.yml-50-53 (1)
50-53:⚠️ Potential issue | 🟠 MajorPin external actions to commit SHAs.
Floating refs (
@v5,@v1) are mutable and vulnerable to tag manipulation and malicious code injection, weakening workflow supply-chain integrity. Pin each action to its full-length commit SHA with the version tag as a comment for readability:uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v5 uses: IBM/actions-ibmcloud-cli@<full-sha> # v1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ibmc-cluster-setup.yml around lines 50 - 53, Pin the GitHub actions to their full-length commit SHAs for security. Replace the floating version refs in the uses clauses: change actions/checkout@v5 to actions/checkout@<full-commit-sha> # v5 and IBM/actions-ibmcloud-cli@v1 to IBM/actions-ibmcloud-cli@<full-commit-sha> # v1, where the full-commit-sha is the complete 40-character commit hash for each action. Include the version tag as a comment after each SHA for readability and maintainability.Source: Linters/SAST tools
.github/workflows/hot-cluster-e2e.yml-53-56 (1)
53-56:⚠️ Potential issue | 🟠 MajorPin workflow actions to commit SHAs instead of floating tags.
actions/checkout@v6andIBM/actions-ibmcloud-cli@v1use mutable floating tags that can be retargeted by maintainers, introducing supply chain attack risk. Pin to full-length commit SHAs (with version tags in comments for readability):uses: actions/checkout@eaf007128c319ff1ea75a61e2153a62b56ea22e09 # v4.2.1 uses: IBM/actions-ibmcloud-cli@8e728fca2c236f84cd72c13ae10e49d5feda90eb0 # v1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/hot-cluster-e2e.yml around lines 53 - 56, Replace the mutable floating version tags in both the `actions/checkout` and `IBM/actions-ibmcloud-cli` action references with their full-length commit SHAs to prevent supply chain attacks. For each action, change from the format `uses: <action>`@v`<version>` to `uses: <action>@<commit-sha>` and add a comment with the version tag for readability using the format `# v<version>`. Look up the current commit SHAs for checkout@v6 and ibmcloud-cli@v1 from their respective repositories and update both action references accordingly.Source: Linters/SAST tools
.github/workflows/ibmc-cluster-setup.yml-123-138 (1)
123-138:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly pass VLAN flags when values are present.
The script says “new VLANs will be created” when none exist, but still passes empty
--private-vlan/--public-vlanvalues, which can fail cluster creation.Suggested fix
- ibmcloud oc cluster create classic \ + VLAN_ARGS=() + [[ -n "${PRIVATE_VLAN}" ]] && VLAN_ARGS+=(--private-vlan "${PRIVATE_VLAN}") + [[ -n "${PUBLIC_VLAN}" ]] && VLAN_ARGS+=(--public-vlan "${PUBLIC_VLAN}") + + ibmcloud oc cluster create classic \ --name "${CLUSTER_NAME}" \ --version "${{ inputs.openshift_version }}" \ --flavor "${{ inputs.worker_flavor }}" \ --workers "${{ inputs.worker_count }}" \ - --zone "${ZONE}" \ - --private-vlan "${PRIVATE_VLAN}" \ - --public-vlan "${PUBLIC_VLAN}" + --zone "${ZONE}" \ + "${VLAN_ARGS[@]}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ibmc-cluster-setup.yml around lines 123 - 138, The ibmcloud oc cluster create classic command always passes the --private-vlan and --public-vlan flags even when PRIVATE_VLAN and PUBLIC_VLAN are empty, which prevents IBM Cloud from automatically creating new VLANs. Conditionally append these flags to the ibmcloud command only when the PRIVATE_VLAN and PUBLIC_VLAN variables are non-empty, similar to the conditional logic already present in the if statement that checks for existing VLANs, so that new VLANs are created automatically when neither variable is set..github/workflows/ibmc-cluster-teardown.yml-61-61 (1)
61-61:⚠️ Potential issue | 🟠 MajorReplace
curl | bashwith officialazure/setup-helmaction.The current implementation downloads and executes unverified remote code at runtime. Use the official
azure/setup-helmaction (e.g.,azure/setup-helm@v5) instead, which provides secure, version-pinned installation with no script execution. This eliminates the remote code execution risk while improving reliability through built-in caching and proper environment configuration.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ibmc-cluster-teardown.yml at line 61, Replace the inline curl piped to bash pattern that downloads and executes the helm installation script from the remote GitHub repository with the official azure/setup-helm action version 5 or later. Remove the condition checking for the helm command and the curl command piping to bash, and instead use the azure/setup-helm@v5 action as a dedicated workflow step which provides secure, version-pinned installation without remote script execution..github/workflows/ibmc-cluster-teardown.yml-37-37 (1)
37-37:⚠️ Potential issue | 🟠 MajorPin
IBM/actions-ibmcloud-clito a commit SHA instead of@v1.Version tags are mutable and allow the action to be updated without your knowledge, creating supply-chain risk. Use the full commit SHA with the version tag as a comment for readability, following the pattern already established in this repository (e.g.,
IBM/actions-ibmcloud-cli@<commit-sha> # v1).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ibmc-cluster-teardown.yml at line 37, The IBM/actions-ibmcloud-cli action on line 37 uses a mutable version tag (`@v1`) which creates supply-chain risk. Replace the `@v1` tag with the full commit SHA for that version, and add a comment with the version tag for readability in the format: IBM/actions-ibmcloud-cli@<full-commit-sha> # v1. Look for similar patterns already established elsewhere in the repository workflows to determine the correct commit SHA to use for v1.Source: Linters/SAST tools
ci-scripts/helm/ci-env-controller/templates/clusterrole-console.yaml-2-2 (1)
2-2:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftCurrent binding model gives the console SA cluster-wide write access.
Line 2 documents this role is bound via ClusterRoleBinding. Combined with Line 61–73 mutating verbs (
create/update/patch/delete), that grants write capability across all namespaces, not only the CI test namespace.A safer pattern is:
- keep cluster-scoped read permissions in this ClusterRole, and
- move mutating namespaced permissions to a namespaced
Rolebound withRoleBindingper test namespace.Also applies to: 61-73
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/helm/ci-env-controller/templates/clusterrole-console.yaml` at line 2, The ClusterRole for the console ServiceAccount currently grants cluster-wide write access through mutating verbs (create, update, patch, delete) on namespaced resources. Refactor this by removing the mutating verb definitions from lines 61-73 in the ClusterRole, keeping only the read-only permissions (get, list, watch). Create a separate namespaced Role with the mutating verbs (create, update, patch, delete) that will be bound per CI test namespace using a RoleBinding instead of ClusterRoleBinding. This limits write access to only the specific test namespace while maintaining cluster-wide read access where needed.ci-scripts/helm/ci-env-controller/templates/clusterrole-controller.yaml-11-13 (1)
11-13:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove mutating permissions from
nodesaccess.Line 11–13 combines
nodesandnamespaceswith create/update/patch/delete verbs. For cluster discovery,nodesshould be read-only; keeping write verbs here over-privileges the controller.🔧 Suggested RBAC split
- - apiGroups: [''] - resources: ['nodes', 'namespaces'] - verbs: ['get', 'list', 'watch', 'create', 'update', 'patch', 'delete'] + - apiGroups: [''] + resources: ['nodes'] + verbs: ['get', 'list', 'watch'] + - apiGroups: [''] + resources: ['namespaces'] + verbs: ['get', 'list', 'watch', 'create', 'update', 'patch', 'delete']🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/helm/ci-env-controller/templates/clusterrole-controller.yaml` around lines 11 - 13, The ClusterRole rule is granting unnecessary write permissions (create, update, patch, delete) to the nodes resource. Split this rule into two separate rules: one for nodes with only read-only verbs (get, list, watch), and another for namespaces with the appropriate verbs. This ensures the controller can only read node information for cluster discovery without being able to modify nodes.ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh-82-109 (1)
82-109:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail provisioning when proxy route is unavailable.
Route creation errors are currently logged and ignored, then provisioning continues with a potentially invalid
PLUGIN_PROXY_ENDPOINT. This can mark environments as provisioned while core traffic routing is broken.Suggested fix
ensure_proxy_route() { @@ - if oc get route "${route_name}" -n "${route_ns}" &>/dev/null; then + if oc get route "${route_name}" -n "${route_ns}" &>/dev/null; then log "Proxy route already exists in ${route_ns}" else log "Creating proxy route ${route_name} in ${route_ns}..." - cat <<EOF | oc create -f - 2>/dev/null || log "Proxy route create skipped (may already exist or namespace missing)" + cat <<EOF | oc create -f - 2>/dev/null || true apiVersion: route.openshift.io/v1 @@ EOF fi + + if ! oc get route "${route_name}" -n "${route_ns}" &>/dev/null; then + log "ERROR: proxy route ${route_name} is unavailable in ${route_ns}" + return 1 + fi PLUGIN_PROXY_ENDPOINT="https://${proxy_host}" } @@ - ensure_proxy_route + ensure_proxy_route || { + patch_cm "${cm_name}" '{"data":{"status":"error","error-message":"proxy route unavailable"}}' + return 1 + }Also applies to: 145-146
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh` around lines 82 - 109, The proxy route creation failure is silently logged and ignored, allowing the script to continue setting PLUGIN_PROXY_ENDPOINT with a potentially invalid value. After the oc create command in the route creation block, add a verification check using oc get route to confirm the route was actually created successfully. If the verification fails, exit the script with an error instead of continuing provisioning. Only set the PLUGIN_PROXY_ENDPOINT variable after confirming the route exists and is accessible.ci-scripts/helm/ci-env-controller/templates/role-trigger.yaml-11-14 (1)
11-14:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftScope runner ConfigMap permissions away from controller-owned objects.
This Role grants the runner SA full CRUD on every ConfigMap in the controller namespace. A compromised or buggy runner can modify/delete
ci-env-controller-scriptand other non-trigger ConfigMaps, causing controller outage or tampering.Suggested direction
- Isolate trigger ConfigMaps into a dedicated namespace and bind this Role there.
- Keep controller operational ConfigMaps (script/config) in a separate namespace not writable by runner SAs.
- If namespace split is deferred, at minimum remove nonessential verbs (especially
delete) to reduce immediate blast radius.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/helm/ci-env-controller/templates/role-trigger.yaml` around lines 11 - 14, The Role in role-trigger.yaml grants excessive permissions that allow the runner SA to modify or delete all ConfigMaps in the namespace, including critical controller ConfigMaps like ci-env-controller-script. Restrict the RBAC permissions by either implementing namespace segregation (isolate trigger ConfigMaps to a dedicated namespace and keep controller ConfigMaps in a separate namespace) or at minimum reduce the permission scope by removing the delete verb and using resourceNames or label selectors to limit access to only trigger-related ConfigMaps that the runner actually needs to interact with.ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh-174-176 (1)
174-176:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd timeout bounds to readiness polling curl.
The loop uses
curlwithout--connect-timeout/--max-time; a single hung request can block reconciliation indefinitely.Suggested fix
- if curl -s -o /dev/null -w "%{http_code}" "${bridge_base}/" 2>/dev/null | grep -qE '200|301|302'; then + if curl -sS --connect-timeout 2 --max-time 5 \ + -o /dev/null -w "%{http_code}" "${bridge_base}/" 2>/dev/null | grep -qE '200|301|302'; then🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh` around lines 174 - 176, The curl command in the readiness polling loop that checks the bridge_base URL lacks timeout bounds, which can cause the loop to hang indefinitely if a request stalls. Add the --connect-timeout and --max-time flags to the curl command that checks "${bridge_base}/" to ensure that individual requests have reasonable time limits and cannot block reconciliation. These timeout values should be reasonable for your use case, typically in the range of 5-10 seconds each.ci-scripts/helm/ci-test-stack/templates/console-sa.yaml-9-20 (1)
9-20:⚠️ Potential issue | 🟠 MajorAvoid creating a legacy long-lived service-account token Secret.
Pre-creating a
kubernetes.io/service-account-tokenSecret introduces a persistent credential object and weakens token lifecycle security. The token is currently injected into the console Deployment as an environment variable (BRIDGE_K8S_AUTH_BEARER_TOKEN). Prefer projected bound tokens (usingvolumeswithserviceAccountTokenprojection andvolumeMounts) or explicit TokenRequest at runtime, though this would require the console application to read the token from a mounted file rather than an environment variable.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/helm/ci-test-stack/templates/console-sa.yaml` around lines 9 - 20, The legacy kubernetes.io/service-account-token Secret resource in the console-sa.yaml template creates a persistent credential and weakens security. Remove the entire Secret resource definition that uses type kubernetes.io/service-account-token. Instead, update the console Deployment to use projected bound tokens by adding a volumes section with a serviceAccountToken projection and corresponding volumeMounts, removing the BRIDGE_K8S_AUTH_BEARER_TOKEN environment variable injection. This requires the console application to read the token from the mounted file path instead of from an environment variable.ci-scripts/helm/ci-test-stack/templates/_helpers.tpl-29-29 (1)
29-29:⚠️ Potential issue | 🟠 MajorRemove hardcoded
cluster.localfrom the plugin service URL.Line 29 hardcodes
cluster.localin thepluginUrlhelper, which fails on clusters using a non-default service DNS domain. Use just.svcinstead, allowing Kubernetes DNS search path resolution.Suggested change
- http://{{ include "ci-test-stack.pluginName" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.plugin.port }} + http://{{ include "ci-test-stack.pluginName" . }}.{{ .Release.Namespace }}.svc:{{ .Values.plugin.port }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/helm/ci-test-stack/templates/_helpers.tpl` at line 29, The pluginUrl helper template in the _helpers.tpl file hardcodes the Kubernetes DNS domain as cluster.local in the plugin service URL, which breaks on clusters using non-default service DNS domains. Modify the URL construction to remove the hardcoded cluster.local domain and use only .svc, allowing Kubernetes DNS search path resolution to automatically discover the correct domain based on the cluster configuration.ci-scripts/helm/ci-test-stack/templates/console-deployment.yaml-44-50 (1)
44-50:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd
BRIDGE_K8S_AUTH=bearer-tokento match the bearer-token flow.Line 46 injects a bearer token, but the deployment never sets
BRIDGE_K8S_AUTH. That creates a config mismatch versus the expected off-cluster auth contract and can cause console API auth failures.Suggested fix
- name: BRIDGE_USER_AUTH value: "disabled" + - name: BRIDGE_K8S_AUTH + value: "bearer-token" - name: BRIDGE_K8S_AUTH_BEARER_TOKEN valueFrom: secretKeyRef: name: {{ include "ci-test-stack.consoleName" . }}-token key: token🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/helm/ci-test-stack/templates/console-deployment.yaml` around lines 44 - 50, The deployment injects a bearer token via BRIDGE_K8S_AUTH_BEARER_TOKEN but does not set the BRIDGE_K8S_AUTH environment variable to specify the authentication mechanism. Add a new environment variable entry for BRIDGE_K8S_AUTH with value "bearer-token" in the env section of the console-deployment template, positioned alongside the other auth-related variables like BRIDGE_USER_AUTH and BRIDGE_K8S_AUTH_BEARER_TOKEN, to properly configure the off-cluster authentication contract and prevent console API auth failures.start-console.sh-99-100 (1)
99-100:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid
evalwhen consumingresolve-console-image.shoutput.Line 99 executes arbitrary shell text if the helper output is ever malformed or tainted. Parse
CONSOLE_IMAGE=explicitly instead of evaluating.Suggested fix
-eval "$(bash ./ci-scripts/resolve-console-image.sh)" || true +resolved_console_image="$(bash ./ci-scripts/resolve-console-image.sh 2>/dev/null || true)" +case "${resolved_console_image}" in + CONSOLE_IMAGE=*) CONSOLE_IMAGE="${resolved_console_image#CONSOLE_IMAGE=}" ;; +esac CONSOLE_IMAGE="${CONSOLE_IMAGE:-quay.io/openshift/origin-console:latest}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@start-console.sh` around lines 99 - 100, The eval statement on line 99 poses a security risk by executing arbitrary shell code if the output from resolve-console-image.sh is malformed or tainted. Replace the eval approach by capturing the output of the resolve-console-image.sh script into a variable and then explicitly extracting the CONSOLE_IMAGE value from that output using grep or parameter expansion, rather than evaluating it as shell code. This ensures only the CONSOLE_IMAGE assignment is processed while safely handling any potentially malicious or malformed output.ci-scripts/install-hco.sh-141-145 (1)
141-145:⚠️ Potential issue | 🟠 MajorPin HPP manifest source to an immutable git ref (tag/SHA).
The configured default
release-v0.21is a mutable branch reference, making installs non-reproducible and vulnerable to upstream drift between runs. Use a specific tag or commit SHA instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/install-hco.sh` around lines 141 - 145, The HPP_VERSION variable is currently set to a mutable branch reference (release-v0.21) which causes non-reproducible installations and vulnerability to upstream drift. Change the HPP_VERSION variable to use an immutable git reference instead, such as a specific semantic version tag (like v0.21.0) or a commit SHA. This will ensure that both the hostpathprovisioner_cr.yaml and storageclass-wffc-csi.yaml manifests are fetched from a stable, unchanging source across multiple installation runs.ci-scripts/install-hco.sh-147-149 (1)
147-149:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDefault StorageClass mutation can leave the cluster without any default.
Line 147 clears defaults globally before confirming
hostpath-csiis present/writable. A failure on Line 148 can break subsequent namespace provisioning on the shared cluster.Suggested fix
- oc annotate storageclasses --all storageclass.kubernetes.io/is-default-class- || true - oc annotate storageclass hostpath-csi storageclass.kubernetes.io/is-default-class='true' + if ! oc get storageclass hostpath-csi >/dev/null 2>&1; then + echo "ERROR: hostpath-csi StorageClass not found; refusing to clear existing defaults" + exit 1 + fi + oc annotate storageclass hostpath-csi storageclass.kubernetes.io/is-default-class='true' --overwrite + oc get storageclass -o name | grep -v '^storageclass.storage.k8s.io/hostpath-csi$' | \ + xargs -r oc annotate storageclass storageclass.kubernetes.io/is-default-class- || true🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/install-hco.sh` around lines 147 - 149, The issue is that line 147 removes all default StorageClass annotations globally, and if the subsequent command on line 148 (which sets hostpath-csi as the default) fails, the cluster is left without any default StorageClass, breaking namespace provisioning. Before removing all default annotations on line 147, first verify that the hostpath-csi StorageClass exists and is accessible. Reorder the commands so that hostpath-csi availability is confirmed and set as default before clearing the global defaults, or add error handling to ensure a default StorageClass is always maintained even if the hostpath-csi annotation command fails.ci-scripts/start-console.sh-94-99 (1)
94-99:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
CONSOLE_IMAGEfallback is missing after resolver failure.Line 96 swallows resolver errors, but Line 98 immediately hard-fails if
CONSOLE_IMAGEis unset. That contradicts the fallback behavior documented in the block comment.Suggested fix
eval "$(bash "${SCRIPT_DIR}/resolve-console-image.sh")" || true +CONSOLE_IMAGE="${CONSOLE_IMAGE:-quay.io/openshift/origin-console:latest}" CONSOLE_PORT=${CONSOLE_PORT:-9000} -: "${CONSOLE_IMAGE:?CONSOLE_IMAGE must be set or resolved before starting the console}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/start-console.sh` around lines 94 - 99, The resolve-console-image.sh script call suppresses errors with || true, but there is no actual fallback value set for CONSOLE_IMAGE when that script fails, causing the parameter expansion check on line 98 to hard-fail. This contradicts the documented fallback behavior in the block comment that mentions falling back to :latest on failure. Add a fallback assignment for CONSOLE_IMAGE (such as setting it to a base image name with the :latest tag) that will be used when the resolver script does not successfully set the variable, ensuring the fallback behavior matches the documented intent.ci-scripts/check-cluster-health.sh-88-97 (1)
88-97:⚠️ Potential issue | 🟠 Major | ⚡ Quick winARC listener check can false-pass on unrelated running pods.
Lines 89-92 count all Running pods in
ARC_RUNNERS_NS; any unrelated pod can satisfy this check even if the listener is unhealthy, weakening the E2E gate.Suggested fix
check "ARC listener pod in ${ARC_RUNNERS_NS}" bash -c " - running=\$(oc get pods -n '${ARC_RUNNERS_NS}' --no-headers 2>/dev/null | grep -c 'Running') + running=\$(oc get pods -n '${ARC_RUNNERS_NS}' --no-headers 2>/dev/null | awk '\$1 ~ /listener/ && \$3 == \"Running\" {c++} END {print c+0}') if [[ \"\${running}\" -ge 1 ]]; then echo \" \${running} Running pod(s) (listener/controller)\" exit 0 else🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/check-cluster-health.sh` around lines 88 - 97, The ARC listener pod check in the check function is counting all running pods in the ARC_RUNNERS_NS namespace, which can false-pass if any unrelated pod is running even when the listener pod itself is down. Replace the generic grep pattern that counts all "Running" pods with a more specific filter that only matches the actual ARC listener or controller pod by name pattern (such as grepping for a pod name prefix like "arc-runner-set" or the actual listener pod naming convention). This ensures the check specifically validates the listener pod is running, not just any pod in that namespace.ci-scripts/resolve-console-image.sh-15-18 (1)
15-18:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve
:latestfallback when ClusterVersion is unavailable.Line 15-18 hard-fails, but the controller-side resolver uses
CONSOLE_IMAGE_REGISTRY:latestfallback. This contract mismatch can fail console startup during transient cluster-version lookup errors.Suggested fix
if [[ -z "${VERSION}" ]]; then - echo "::error::Could not read .status.desired.version from ClusterVersion 'version'. Is this an OpenShift cluster and is oc authenticated?" >&2 - exit 1 + CONSOLE_IMAGE="${CONSOLE_IMAGE_REGISTRY}:latest" + echo "::warning::Could not read .status.desired.version from ClusterVersion 'version'. Falling back to ${CONSOLE_IMAGE}." >&2 + echo "CONSOLE_IMAGE=${CONSOLE_IMAGE}" + exit 0 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-scripts/resolve-console-image.sh` around lines 15 - 18, The script hard-fails when the VERSION variable cannot be read from ClusterVersion, but the controller-side resolver has a fallback mechanism to use CONSOLE_IMAGE_REGISTRY:latest. Instead of exiting with an error in the if block that checks for empty VERSION, modify the script to allow graceful fallback behavior. Either set VERSION to a default value (like latest) when the ClusterVersion lookup fails, or remove the hard exit and allow the script to continue with a fallback, ensuring the script's behavior is consistent with the controller's contract of using :latest as a fallback during transient cluster-version lookup errors.
| kubevirt-plugin-image: ${{ env.KUBEVIRT_PLUGIN_IMAGE }} | ||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v6 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/hot-cluster-e2e-run.yml | sed -n '170,180p; 185,195p; 230,240p; 290,300p; 328,338p'Repository: kubevirt-ui/kubevirt-plugin
Length of output: 2332
🌐 Web query:
GitHub Actions security best practice commit SHA pinning vs version tags supply chain
💡 Result:
Pinning GitHub Actions to a full-length commit SHA is a primary security best practice to prevent supply chain attacks, as it ensures the action references an immutable version of the code [1][2][3]. Comparison of Pinning Methods - Commit SHA Pinning (Recommended): Provides an immutable reference. Even if an attacker gains access to the action's repository and updates an existing tag (a "tag retargeting" attack), a workflow pinned to a specific SHA will continue to execute the original, trusted code [3][4][5]. - Version Tags (e.g., @v1, @v1.2): These are mutable references. If an action's maintainer account is compromised, attackers can move or delete these tags to point to malicious code, which will then be automatically used by any workflow referencing those tags [1][4][6]. While more convenient, tags should only be used if you fully trust the creator and are aware of the inherent risks [1][7]. Supply Chain Security Nuance While SHA pinning protects against tag manipulation, it is not a complete solution: - Immutable but not invulnerable: Pinning only ensures the referenced commit remains unchanged. If the action code itself is malicious (e.g., an attacker submits a malicious PR that is merged and then pinned), pinning will not prevent the execution of that code [4]. - Runtime behaviors: Some actions, even when pinned to a SHA, may fetch additional dependencies or data at runtime (e.g., via curl, npm, or docker pull) that are unversioned or dynamic (like latest), effectively bypassing the protection of the initial SHA pin [5][8]. - Maintenance: Pinned SHAs require manual or automated updates to receive bug fixes and security patches. Tools such as Renovate, pinact, ghavm, or pinprick are commonly used to automate this maintenance by keeping pinned SHAs up to date with the latest releases [4][5][8][9][10][6]. Policy and Governance GitHub supports enforcing SHA pinning at the repository and organization level [1][2][5]. When enabled, this policy causes any workflow that attempts to reference an action without a full-length commit SHA to fail, ensuring consistent adherence to security standards across an entire organization [2][5][6]. Best Practice Recommendation To maximize security: 1. Pin all third-party actions to full-length commit SHAs [5][6]. 2. Maintain human-readable context by including the original version tag as a comment in the workflow file (e.g., uses: actions/checkout@sha # v4.0.0) [11][6]. 3. Automate the updates of these pins using tools like Renovate or Dependabot to ensure you continue to receive security updates [4][6]. 4. Audit actions for "runtime fetch" patterns that might bypass static pinning [5][8].
Citations:
- 1: https://docs.github.com/en/actions/reference/security/secure-use
- 2: https://github.blog/changelog/2025-08-15-github-actions-policy-now-supports-blocking-and-sha-pinning-actions/
- 3: https://www.stepsecurity.io/blog/pinning-github-actions-for-enhanced-security-a-complete-guide
- 4: https://emmer.dev/blog/pin-your-github-actions-to-protect-against-mutability/
- 5: https://mikael.barbero.tech/blog/post/2026-03-24-stop-trusting-mutable-references/
- 6: https://architecture.lullabot.com/adr/20260505-pin-github-actions-to-hash/
- 7: https://github.com/github/docs/blob/962a1c8dccb8c0f66548b324e5b921b5e4fbc3d6/content/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions.md
- 8: https://github.com/starhaven-io/pinprick
- 9: https://github.com/mccutchen/ghavm
- 10: https://github.com/suzuki-shunsuke/pinact
- 11: https://github.com/Hyland/github-actions-ensure-sha-pinned-actions
Pin all external actions to full-length commit SHAs to prevent supply chain attacks.
actions/checkout@v6, docker/metadata-action@v6, and actions/upload-artifact@v6 use mutable version tags instead of immutable commit SHAs. If an action maintainer's account is compromised, attackers can retarget version tags to point to malicious code, automatically executing the compromised version in all workflows using those tags. Replace with commit SHAs (e.g., actions/checkout@abc1234def567890), optionally with the version tag as a comment for clarity. Consider automating updates using Renovate or Dependabot.
Applies to lines: 176, 190, 236, 297, 333
🧰 Tools
🪛 zizmor (1.25.2)
[warning] 175-176: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
[error] 176-176: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/hot-cluster-e2e-run.yml at line 176, Replace all mutable
action version tags with their corresponding full-length commit SHAs to prevent
supply chain attacks. In the workflow file, locate all instances of
`actions/checkout@v6`, `docker/metadata-action@v6`, and
`actions/upload-artifact@v6` (found at lines 176, 190, 236, 297, and 333) and
replace the `@v6` or version tag portion with the full commit SHA for each
action. You can optionally include the version tag as a comment for clarity
(e.g., `actions/checkout@abc1234def567890 # v6`). Verify each commit SHA is
correct by checking the official repository for each action.
Source: Linters/SAST tools
| # | ||
| # Start the "off cluster" console. Based on the `route-console.sh` and `start-console.sh` scripts. | ||
| # | ||
| set -euox pipefail |
There was a problem hiding this comment.
set -x exposes the bearer token before masking.
Line 5 enables xtrace, so the token assignment at Line 81 is logged before Line 86 applies masking. This leaks cluster credentials to CI logs.
Suggested fix
-set -euox pipefail
+set -euo pipefail
+# Optional debug:
+# [[ "${DEBUG:-false}" == "true" ]] && set -xAlso applies to: 81-87, 134-149
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ci-scripts/start-console.sh` at line 5, The xtrace flag enabled by the `-x`
option in the set command on line 5 causes all commands and their arguments to
be printed to output before execution. This exposes the bearer token assignment
at line 81 to CI logs before the masking logic at line 86 can hide it. Remove
the `-x` flag from the set command so xtrace is disabled and the token
assignment will not be logged to the output before masking is applied.
Manual-only setup/teardown workflows plus ci-scripts so cluster provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099). Does not change existing CI triggers or behavior.
Manual-only setup/teardown workflows plus ci-scripts so cluster provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099). Does not change existing CI triggers or behavior.
Manual-only setup/teardown workflows plus ci-scripts so cluster provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099). Does not change existing CI triggers or behavior.
Manual-only setup/teardown workflows plus ci-scripts so cluster provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099). Does not change existing CI triggers or behavior.
Manual-only setup/teardown workflows plus ci-scripts so cluster provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099). Does not change existing CI triggers or behavior.
Manual-only setup/teardown workflows plus ci-scripts so cluster provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099). Does not change existing CI triggers or behavior.
Establish a basic GitHub action based CI workflow for running e2e tests on a hot-cluster. POC workflows include setting up IBM Cloud ROKS clusters. Full documentation is in the `POC_HOT_CLUSTER_CI.md` and referenced files. The hot-cluster can be any OpenShift cluster (FIPS and non-FIPS) and is based on the GitHub Actions runner controller (ARC). It is installed via helm, and only requires network access to pull from GitHub within the cluster itself. The POC can even be run on a local CRC / OpenShift local development cluster without any special networking configuration. The self-hosted runners are ephemeral and are deleted after the CI job is complete. Full setup documentation is in the `ci-scripts/README.md` file. Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
`ci-env` controller:
- manage CI test environments using ConfigMaps
- GitHub actions to request and release the CI test environment
via ConfigMaps
`ci-test-stack`:
- helm chart to configure and deploy the console and plugin
- console and plugin are deployed into the test environment namespace
- access from the workflow is via the service endpoints
- external route is available for debugging
test2 workflow updates to use the GitHub actions
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
…e controller sh Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Correct "POC Hot ClusterE2E CI Test" to "POC Hot Cluster E2E CI Test" (missing space before E2E) in both the workflow-to-name table and the Variant A usage steps. Matches the actual name: field in .github/workflows/poc-e2e-ci-test.yml. Made-with: Cursor
Remove the sequential dependency between check-runner and build-kubevirt-plugin-image. The diagnostics job produces no outputs consumed by the build, so there is no functional coupling. Gate run-gating-tests on both jobs so it still waits for runner validation before provisioning the test environment. This saves the full duration of check-runner (~1-2 min) from the critical path on every run. Made-with: Cursor
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
- Remove duplicate poc-e2e-ci-test.yml; keep hot-cluster-e2e.yml as entry
- Rename poc-e2e-ci-test2.yml to hot-cluster-e2e-run.yml
- Delete POC_HOT_CLUSTER_CI.md; move backlog to docs/HOT_CLUSTER_FUTURE_WORK.md
- Add docs/HOT_CLUSTER_CI_CONTINUATION.md and docs/CLUSTER_LIFECYCLE.md
- Use IC_KEY secret, secure oc install, Playwright E2E updates
- Update ci-scripts/README and auto-teardown workflow references
Instead of failing when cos_instance_crn is empty, look for an existing COS instance named <cluster_name>-cos, or create one automatically. This makes the VPC path work out-of-the-box without pre-setup. Co-authored-by: Cursor <cursoragent@cursor.com>
ibmcloud resource service-instance-create does not support --output json. Create first, then query the CRN with service-instances --output json. Co-authored-by: Cursor <cursoragent@cursor.com>
The 'standard' plan no longer works; use 'premium-global-deployment' which is the first option shown by the CLI. Co-authored-by: Cursor <cursoragent@cursor.com>
Use the plan ID (744bfc56-d12c-4866-88d5-dac9139e0e5d) instead of the plan name to avoid interactive prompts or name resolution issues. Co-authored-by: Cursor <cursoragent@cursor.com>
The standard plan ID resolves to two deployments (premium-global-deployment, premium-global-deployment-iam). Use -d premium-global-deployment to select non-interactively. Co-authored-by: Cursor <cursoragent@cursor.com>
Probes VPC, COS, DNS Services, CIS, IAM Identity, resource groups, and authorization policies to confirm readiness for OpenShift IPI on IBM Cloud VPC. Run with INFRASTRUCTURE_TYPE=ipi. Co-authored-by: Cursor <cursoragent@cursor.com>
Select infrastructure_type=ipi to run IPI prerequisite checks (VPC, COS, DNS, IAM Identity, resource groups, auth policies) without creating any resources. All common steps are skipped. Co-authored-by: Cursor <cursoragent@cursor.com>
When infrastructure_type=ipi, downloads openshift-install 4.20.8, generates install-config.yaml for IBM Cloud VPC, and attempts cluster creation. Includes automatic cleanup on failure and artifact upload for debugging. Expects OPENSHIFT_PULL_SECRET secret. Co-authored-by: Cursor <cursoragent@cursor.com>
openshift-install uses hyphen-separated VPC instance profiles (bx2-4x16) while ibmcloud oc uses dot-separated (bx2.4x16). Convert automatically. Co-authored-by: Cursor <cursoragent@cursor.com>
Summarizes the Jun 24 testing session: three provisioning paths (classic, VPC ROKS, IPI), what works, what's blocked, and exact next steps for tomorrow after DNS/COS authorization is configured. Co-authored-by: Cursor <cursoragent@cursor.com>
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Fetch the build log - first try with curl to get a portion of it
curl -s --max-time 30 "https://storage.googleapis.com/test-platform-results/pr-logs/pull/kubevirt-ui_kubevirt-plugin/4099/pull-ci-kubevirt-ui-kubevirt-plugin-main-kubevirt-e2e-aws/2069836692671434752/build-log.txt" | tail -300Length of output: 15132 The failure is an infrastructure/environment flake, not a code issue introduced by this PR. Root cause: After successfully provisioning a fresh AWS cluster (~57 min IPI install) and creating the HCO catalog source + subscription, all nine HCO-managed operator deployments timed out simultaneously: This is a well-known OLM/HCO rollout flake in Prow's AWS CI — the operators failed to become available within the wait window on a fresh cluster. The PR's changes are entirely GitHub Actions workflows, Helm charts, and CI scripts; none touch HCO operator deployment logic. /retest |
Domain is registered with IBM Cloud CIS nameservers. IPI will create api.kubevirt-plugin-ci.cnv-ui.com and *.apps.kubevirt-plugin-ci.cnv-ui.com DNS records. Co-authored-by: Cursor <cursoragent@cursor.com>
The IC_KEY identity cannot create resource groups. Tell the installer to use the existing cnv-ui group instead of creating a new one. Co-authored-by: Cursor <cursoragent@cursor.com>
Setup workflow: - IPI now leaves cluster running on success (cleanup only on failure) - All bootstrap steps (HCO, ARC, ci-env) run for all infra types - IPI uses installer kubeconfig instead of ibmcloud oc - Fixed install-config.yaml heredoc indentation - Added credentialsMode: Manual + ccoctl manifest generation - OpenShift version configurable for IPI (stable-4.XX channel) - Default infrastructure_type changed to ipi Teardown workflow: - Added infrastructure_type input (ipi/vpc/classic) - IPI teardown downloads install state and runs openshift-install destroy Auto-teardown: - Detects IPI clusters via DNS probe - Added business_hours_only option (Sun-Thu 8-18 Israel) - Idle threshold shown in summary Cleanup: - Deleted POC files and status doc - Consolidated README with IPI docs Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Move install-config.yaml content to a standalone template file (ci-scripts/ipi-install-config.yaml.tpl) and use envsubst to substitute variables. Fixes the workflow YAML validation error that prevented workflow_dispatch from working. Co-authored-by: Cursor <cursoragent@cursor.com>
oc adm release extract needs the oc binary which was not available during the CCO manifest step. Download oc, kubectl, and ccoctl alongside openshift-install so credentials can be properly generated. Co-authored-by: Cursor <cursoragent@cursor.com>
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
url="https://storage.googleapis.com/test-platform-results/pr-logs/pull/kubevirt-ui_kubevirt-plugin/4099/pull-ci-kubevirt-ui-kubevirt-plugin-main-kubevirt-e2e-aws/2069892807979438080/build-log.txt"
echo "=== LAST 250 LINES ==="
curl -fsSL --max-time 60 "$url" | tail -250
echo
echo "=== FAILURE KEYWORDS ==="
curl -fsSL --max-time 60 "$url" \
| rg -n -C3 'level=fatal|ERROR|Error:|FAIL|failed|timed out|context deadline exceeded|connection reset|EOF|No such host|Temporary failure|i/o timeout|dial tcp|TLS handshake timeout|panic:' || trueLength of output: 23624
This does not look like an infrastructure or environment flake. The job failed because a Playwright assertion never became true in the test flow for creating an example template from YAML:
The failure reproduced on the retry as well, which points to a real code/test issue rather than a transient cluster problem. There was an earlier temporary timeout while waiting for |
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add a pre-step that finds and deletes VMs, load balancers, and DNS records from previous IPI installs with the same cluster name prefix. Also adds a one-off IPI destroy workflow for manual cleanup. Co-authored-by: Cursor <cursoragent@cursor.com>
…ter name The previous cleanup deleted all VMs/LBs matching the cluster name prefix, which would prevent parallel clusters. Now only cleans stale DNS records (the actual blocker for re-installs) and scopes to the exact cluster name. VMs and LBs are managed by openshift-install destroy, not brute-force. Co-authored-by: Cursor <cursoragent@cursor.com>
…le it The IC_KEY doesn't have iam.policy.create permission so ccoctl fails. Remove credentialsMode: Manual and the CCO manifests step entirely. Let openshift-install create cluster handle credentials using the API key directly. Also fixes the auth/ dir missing issue (create manifests was consuming install-config before create cluster). Co-authored-by: Cursor <cursoragent@cursor.com>
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
url="https://storage.googleapis.com/test-platform-results/pr-logs/pull/kubevirt-ui_kubevirt-plugin/4099/pull-ci-kubevirt-ui-kubevirt-plugin-main-kubevirt-e2e-aws/2069933323295133696/build-log.txt"
echo "=== LAST 300 LINES ==="
curl -fsSL --max-time 90 "$url" | tail -300
echo
echo "=== FAILURE KEYWORDS (with context) ==="
curl -fsSL --max-time 90 "$url" \
| rg -n -C4 'level=fatal|ERROR|Error:|FAIL|failed|timed out|context deadline exceeded|connection reset|EOF|No such host|Temporary failure|i/o timeout|dial tcp|TLS handshake timeout|panic:|assertion|expect\(' || trueLength of output: 45429
/retest |
IBM Cloud IPI requires credentialsMode: Manual but ccoctl needs iam.policy.create which we don't have. Instead, generate the ibm-cloud-credentials secret manifests directly using the API key for all required namespaces (CCM, machine-api, image-registry, ingress, CSI). This fixes the CCM FailedMount error that prevented node initialization and caused the entire cluster to be broken. Co-authored-by: Cursor <cursoragent@cursor.com>
The cleanup step was positioned before the common bootstrap steps, so it only ran when IPI-specific steps failed. Now it runs at the end of the job and destroys the cluster if ANY step failed (HCO, ARC, health check, etc.). No more orphaned clusters from partial setup failures. Co-authored-by: Cursor <cursoragent@cursor.com>
|
CI Triage (auto-retest 1/5 for Classification: Infrastructure failure Log excerpt (infrastructure pattern matched)/retest |
📝 Description
Continues CNV-74265 / PR #3713: move KubeVirt plugin E2E from Prow (new AWS cluster per run) to a persistent IBM Cloud ROKS hot cluster with GitHub Actions + ARC.
This PR rebases Scott's hot-cluster POC onto current
mainand hardens it for upstream use:hot-cluster-e2e.yml(PR + manual entry) andhot-cluster-e2e-run.yml(build plugin, provision test stack, run tests); removed duplicatepoc-e2e-ci-test*.ymlIC_KEYfor IBM Cloud (replacingIBM_CLOUD_API_KEY)ocinstall viainstall-oc-client.sh, HCO InstallPlan polling, fail-fast runner checks, per-run plugin image tagsCluster setup/teardown workflows (
ibmc-cluster-*.yml), ARC, ci-env-controller, and ci-test-stack Helm charts are included from the POC. See ci-scripts/README.md for architecture and testing steps.🔗 Links
Jira ticket: https://redhat.atlassian.net/browse/CNV-74265
Related: #3713
🎥 Demo
WIP — first green run pending IBM Cloud IAM completion.
Latest test runs (2026-06-24):
See testing update comment for IAM details and next steps.
Test plan
mainIC_KEYci-env-controlleris running on the cluster (oc get deploy -n ci-env)Summary by CodeRabbit
New Features
Infrastructure & Automation
Tests
Documentation
Chores